AITopics

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.13)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)
Workflow (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Epidemiology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(5 more...)

Neural Information Processing SystemsFeb-11-2026, 20:41:01 GMT

c62fe1daeb10814d33e5a33ba466ecaf-Paper-Conference.pdf

classifier, optimal roc curve, roc curve, (16 more...)

Country:

North America > United States > Ohio > Franklin County > Columbus (0.04)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Neural Information Processing SystemsFeb-9-2026, 07:35:59 GMT

6fd6b030c6afec018415662d0db43f9d-Supplemental.pdf

alignment loss, model specification, natural language inference, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

arXiv.org Artificial IntelligenceOct-24-2025

Stress-Testing Model Specs Reveals Character Differences among Language Models

Zhang, Jifan, Sleight, Henry, Peng, Andi, Schulman, John, Durmus, Esin

Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

large language model, machine learning, natural language, (20 more...)

2510.07686

Genre: Research Report > New Finding (0.67)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Banking & Finance (1.00)
Health & Medicine > Therapeutic Area (0.68)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsOct-10-2025, 08:12:57 GMT

849b84c0038e5856f2887e5bfe6ced41-Paper-Conference.pdf

dataset, hdtwingen, lung cancer, (15 more...)

Country:

North America > United States (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.13)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)
Workflow (0.67)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Epidemiology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(5 more...)

Neural Information Processing SystemsAug-18-2025, 19:48:56 GMT

On the consistent estimation of optimal Receiver Operating Characteristic (ROC) curve

We formally introduce the notion of optimal ROC curve over a general model space.

artificial intelligence, machine learning, roc curve, (18 more...)

Country:

North America > United States > Ohio > Franklin County > Columbus (0.04)
North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Neural Information Processing SystemsAug-15-2025, 02:51:36 GMT

6fd6b030c6afec018415662d0db43f9d-Supplemental.pdf

alignment loss, model specification, natural language inference, (13 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

arXiv.org Artificial IntelligenceAug-13-2025

LLM-BI: Towards Fully Automated Bayesian Inference with Large Language Models

Huang, Yongchao

A significant barrier to the widespread adoption of Bayesian inference is the specification of prior distributions and likelihoods, which often requires specialized statistical expertise. This paper investigates the feasibility of using a Large Language Model (LLM) to automate this process. We introduce LLM-BI (Large Language Model-driven Bayesian Inference), a conceptual pipeline for automating Bayesian workflows. As a proof-of-concept, we present two experiments focused on Bayesian linear regression. In Experiment I, we demonstrate that an LLM can successfully elicit prior distributions from natural language. In Experiment II, we show that an LLM can specify the entire model structure, including both priors and the likelihood, from a single high-level problem description. Our results validate the potential of LLMs to automate key steps in Bayesian modeling, enabling the possibility of an automated inference pipeline for probabilistic programming.

large language model, machine learning, natural language, (15 more...)

doi: 10.5281/zenodo.16756724

2508.083

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.90)

Sfeir, Georges, Nova, Gabriel, Hess, Stephane, van Cranenburgh, Sander

Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

arXiv.org Artificial IntelligenceJul-30-2025

Large Language Models (LLMs) are widely used to support various workflows across different disciplines, yet their potential in choice modelling remains relatively unexplored. This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models. We implement a systematic experimental framework involving thirteen versions of six leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Gemma, and Llama) evaluated under five experimental configurations. These configurations vary along three dimensions: modelling goal (suggesting vs. suggesting and estimating MNLs); prompting strategy (Zero-Shot vs. Chain-of-Thoughts); and information availability (full dataset vs. data dictionary only). Each LLM-suggested specification is implemented, estimated, and evaluated based on goodness-of-fit metrics, behavioural plausibility, and model complexity. Findings reveal that proprietary LLMs can generate valid and behaviourally sound utility specifications, particularly when guided by structured prompts. Open-weight models such as Llama and Gemma struggled to produce meaningful specifications. Claude 4 Sonnet consistently produced the best-fitting and most complex models, while GPT models suggested models with robust and stable modelling outcomes. Some LLMs performed better when provided with just data dictionary, suggesting that limiting raw data access may enhance internal reasoning capabilities. Among all LLMs, GPT o3 was uniquely capable of correctly estimating its own specifications by executing self-generated code. Overall, the results demonstrate both the promise and current limitations of LLMs as assistive agents in choice modelling, not only for model specification but also for supporting modelling decision and estimation, and provide practical guidance for integrating these tools into choice modellers' workflows.

large language model, machine learning, specification, (19 more...)

2507.2179

Country: Europe (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Nova, Gabriel, Hess, Stephane, van Cranenburgh, Sander

Delphos: A reinforcement learning framework for assisting discrete choice model specification

arXiv.org Artificial IntelligenceJul-28-2025

We introduce Delphos, a deep reinforcement learning framework for assisting the discrete choice model specification process. Unlike traditional approaches that treat model specification as a static optimisation problem, Delphos represents a paradigm shift: it frames this specification challenge as a sequential decision-making problem, formalised as a Markov Decision Process. In this setting, an agent learns to specify well-performing model candidates by choosing a sequence of modelling actions - such as selecting variables, accommodating both generic and alternative-specific taste parameters, applying non-linear transformations, and including interactions with covariates - and interacting with a modelling environment that estimates each candidate and returns a reward signal. Specifically, Delphos uses a Deep Q-Network that receives delayed rewards based on modelling outcomes (e.g., log-likelihood) and behavioural expectations (e.g., parameter signs), and distributes rewards across the sequence of actions to learn which modelling decisions lead to well-performing candidates. We evaluate Delphos on both simulated and empirical datasets, varying the size of the modelling space and the reward function. To assess the agent's performance in navigating the model space, we analyse the learning curve, the distribution of Q-values, occupancy metrics, and Pareto fronts. Our results show that the agent learns to adaptively explore strategies to identify well-performing models across search spaces, even without prior domain knowledge. It efficiently explores large modelling spaces, concentrates its search in high-reward regions, and suggests candidates that define Pareto frontiers balancing model fit and behavioural plausibility. These findings highlight the potential of this novel adaptive, learning-based framework to assist in the model specification process.

machine learning, reinforcement learning, specification, (17 more...)